Car Evaluation

Lab Assignment One: Exploring Table Data


Richmond Aisabor

Business Understanding

Cars are an essential part of our daily lives because they provide a safe and affordable means of transportation. Depending on where you live, there may not be any other alternative to car ownership. Eventhough cars provide transportation, they are usually viewd as a liability so the decision to buy a car is important and has implications to one's standard of living.

In this study, the Car Evaulation Dataset will be used to build a model that can classify cars according to their quality. The dataset has 1728 observations and 6 features, including categorical and numerical features. The dataset was derived from a hierarchical decision model and each feature can be places into one of three concepts: Price, Tech and Comfort. These concepts form the decision making system. The dataset fulfills the lab requirements and is free to download at the UC Irvine Machine Learning Repository.

The user that would benefit the most from a successful classification is a first time car buyer. First time buyers usually have minimal experience with cars and making a decision to buy a car with incomplete information. The information that they due recieve is from a sales representive, who has an incentive to sell the car despite how well the car can actually perform. If a device can determine which car is the best, then the user will have no issues purchasing a car. To be confident that the algorithm is learning properly, it must have a success rate better than 50%, a random chance. The goal for the algorithm is to be as close to 100% accuracy as possible.

Data Understanding

Data Description

After reviewing the information dataframe, there are no missing values in the dataset as each column holds 1728 entries in each feature (the dataset has a total of 1728 observations).

If there were missing values, an imputation procedure could be used to correct the missing entries. Mean and median imputation can be used for numerical values, while mode can be used for categorical values.

The features to discard are number_doors and persons because these features are not useful enough for determining the quality of a car. How many people a car can contain and the number of doors it has imply the size of the vehicle. Vehicle size is a matter of personal preference because a smaller vehicle isn't necessarily less valuable than a larger vehicle and vice versa.

The table above shows the features description, sacles, and range.

Data Quality

After the duplicate check, it shows that 1506 entries in the dataset are duplicate. There are a total of 1728 observations so seeing such a high amount of duplicated entries was alarming. However, keep in mind that all the features in this dataset are categorical so having so many duplicates should be expected. If the features contained continous values then it would be harder to get a duplicate since the range of possible values for each entry increases.

Data Visualization

Data Exploration

Average Maintenance price by Quality

The "Average Maintenance Price by Quality" shows that as quality increases the maintenance price decreases. The lowest maintenance price is at a quality of good and the second lowest maintence price is of very good. This could be because cars that are very good quality are newer vehicles that have more features (luxury vehicles) and therefore are more expensive to maintain. Unacceptable and acceptable quality have the highest maintenance rates, with acceptable quality only having a slightly lower maintenance rate.

Average Safety by Quality

The "Average Safety by Quality" shows that as quality increases the safety of the car also increases. There is a positive correlation between safety and quality. Driving a car is one of the most dangerous activities, so safety is arguably the most important aspect of car ownership and should be an indicator of how good a car is at being a car.

Average Safety and Maintenance Price by Quality

The "Average Safety and Maintenance Price by Quality" shows that as safety and maintenance price increase there are less cars that are unnaceptable quality. All cars with a safety rating of l (low) are unnaceptable quality and cars in this group evaluate to unnaceptable regardless of the maintenance price. At a safety rating of 3 (high), the majorty of cars evaluate to an acceptable, good or very good quality and ss the maintenace price goes down there is an increase of cars that evaluate to good or very good.

Data Relationship Exploration

Correlation Matrix

According to the correlation matrix, boot space and safety are positivily correlated with quality. This means the more safe a car is and the more boot_space the car has the higher the car's quality. Buying and maintenance price are negatively correlated. If the cost to maintain a car goes up, the lower the car's quality.

Scatter plots

The scatter plots show that buying price is negatively correlated with quality. This is unexpected because cars that are higher quality will usually cost more. The expectation was a positive correlation between buying price and quality. A higher quality car typically has more safety features, a bigger boot space etcetera and the more features added to the car the more expensive the cost will be to purchase. Buying price is not a great classifier for my target "quality" so I need to transform the features into two principle components using PCA.

Dimensionality Reduction

Principle Component Analysis

Explained Variance

According the the scree plot, each principle component accounts for 25% of the variation in the data. The scatter plot obtained from performing principle component analysis, uses two principle components. The two components account for 50% of the variation in the data. To provide a good representation of our dataset, there needs to be four principle components

References

UCI Machine Learning Repository. Car Evaluation. https://archive.ics.uci.edu/ml/datasets/Car+Evaluation (Accessed 03-03-2021)